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Abstract 

We present a theory for the growth dynamics of the World Wide Web 
that takes into account the wide range of stochastic growth rates in the 
number of pages per site, as well as the fact that new sites are created 
at different times. This leads to the prediction of a universal power law 
in the distribution of the number of pages per site which we confirm 
experimentally by analyzing data from large crawls made by the search 
engines Alexa and Infoseek. The existence of this power law not only 
implies the lack of any length scale for the Web, but also allows one to 
determine the expected number of sites of any given size without having 
to exhaustively crawl the Web. 
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The World Wide Web (Web) has become in a very short period one of the 
most useful sources of information for a large part of the world's population. Its 
exponential growth, as shown in Figure 1, from a few sites in 1994 to millions 
today, has transformed it into an ecology of knowledge in which highly diverse 
information is linked in extremely complex and arbitrary fashion (1). Moreover, 
several estimates of the total number of pages (2) indicate that due to the rapid 
growth of the Web, most search engines are only finding a fraction of all the 
available sites (2). 
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Source: World Wide Web Consortium, Mark Gray, 
Netcraft Server Survey 

In order to develop an evolutionary theory of the growth of the Web, we first 
consider the number of pages belonging to a given site as a function of time. 
Since pages within sites are typically organized in hierarchical, tree-like, fashion, 
the number of pages added at any given time to a site will be proportional to 
those already existing there. Thus, if n s (t) is the number of pages belonging to a 
site s at time t, the number at the next interval of time, n s (t+l), is determined 
by 

n s (t + 1) = n s (t) + g(t + l)n s (t) (1) 

where g(t)is the growth rate. Given the unpredictable character of site growth, 
we assume that g(t) fluctuates in an uncorrelated fashion from one time interval 
to the other about a positive mean value go- In other words 

9(t)=9o+ffl (2) 

with the fluctuations in growth, £(t), behaving in such a way that < >= 
and < £(t)£(t + 1) >= 2<7r5t,t+i, i.e. they are delta correlated and with zero 
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mean. This assumption was confirmed by a study of the growth of the Xerox 
Corp. Web site, whose fluctuations in growth are plotted in Figure 2. Pearson's 
correlation test accepts at the 0.05 level (with p- value 0.71) the hypothesis that 
the day to day fluctuations in the growth rate are uncorrelated. 
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In order to obtain the distribution of pages per site, we sum Eq. (1) to get 
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Changing the sum to an integral (which assumes that the differences in pages 
between two time steps is small) we obtain 
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Notice that the right hand side of Eq. (4) is a sum over discrete time steps, at 
each of which we assume the values of g to be normally distributed with mean go 
and variance a 2 . This corresponds to a Brownian motion process with stationary 
and independent increments. By invoking the Central Limit Theorem we can 
assert that for every time step t, the logarithm of n s is normally distributed 
with mean got and variance cr 2 i(3,4). This means that the distribution of the 
number of pages for sites created at the same time and with the same average 
growth rate is log- normal (5), i.e, its density is given by 
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where the time dependent drift got is the mean of lnn s , reflecting the fact 
that as time goes on there are more pages added on average than deleted. 
The variance of this distribution is related to the median m = exp(g t) by 
Var(n s ) = m 2 exp (ta 2 ) (exp (£er 2 ) — 1). 

Some insight into the dynamics of this growth can be obtained by noticing 
that the stochastic differential equation associated with Eq. (1), which is given 

by 

^ = [<7o + £WK (6) 

can be solved exactly (6). The solution is the stochastic growth process 

n s (t) = n s (Q)exp(g t + w t ) (7) 

where wt is a Wiener process such that < wt >— and < w 2 >— expa 2 t. 
Equation (7) shows that typical fluctuations in the growth of the number of 
pages away from their mean rate go relax exponentially to zero. On the other 
hand, the n^moments of n s , which are related to the probability of very unlikely 
events, grow in time as < n s (t) n >= [n s (0)] n exp[n(n — ergot] , indicating that 
the evolutionary dynamics of the web is dominated by occasional bursts in 
which large number of pages suddenly appear at a given site. These bursts are 
responsible for the long tail of the probability distribution and make average 
behavior to depart from typical realizations (7). 

In order to consider the evolutionary dynamics of the whole Web, it is im- 
portant to notice that the distribution of the number of pages depends on the 
time that has elapsed since the site was created. Since the number of sites in 
the Web has doubled on average every six months, newer sites are more numer- 
ous than older one, and therefore the distribution of pages per site, for all sites 
of a given growth rate regardless of age, is a mixture of lognormals given by 
Equation (5), whose age parameter t is weighted exponentially. Thus, in order 
to obtain the true distribution of pages per site that grow at the same growth 
rate, we need to compute the mixture given by 

which can be calculated analytically to give 

P(n s ) = Cn-P (9) 

where the constant C is given by C = A/ cr (\/ (ffo/c) 2 + 2A and the exponent j3 
is in the range [— oo, — 1] and determined by /3 = — 1 + ^ — Vgo +^ 2Ag . _ 

Lastly we need to take into account different growth rates for sites of the 
Web, since the distribution given by Eq. (9) applies only to sites that have the 
same growth rate g = g(go,cr). Since each growth rate occurs with a particular 
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probability P(g), and gives rise to a power law distribution in the number of 
pages per site with a specific exponent, the probability that a given site with 
an unknown growth rate has n s pages is given by the sum, over all growth rates 
g, of the probability that the site has so many pages given g, multiplied by the 
probability that a site's growth rate is g, i.e. 

P(n s ) = Y,P(n s \g l )P(g l ) (10) 

i 

Since we have already shown that each particular growth rate gives rise to a 
power law distribution with a specific value of the exponent (3(g), this sum is of 
the form 

p M = - 2 k + s k + ■■■ + ■%; (") 

which, for large values of n s behaves like a power law with an exponent given 
by the smallest power present in the series. 

We thus obtain the very general result that the evolutionary dynamics of 
the World Wide Web gives rise to an asymptotic self similar structure in which 
there is no natural scale, with the number of pages per site distributed according 
to a power law. This implies that on a log-log scale, the number of pages per 
site, for large n, should fall on a straight line. 

In order to test this theory, we studied data generated by crawls of the 
World Wide Web made by two search engines, Alexa (8) and Infoseek(9), which 
covered 259,794 and 525,882 sites respectively. The plots in Figure 3 show the 
probability of drawing at random from the sites in the crawls a site with a 
given number of pages. Both data sets display a power law over several orders 
of magnitude, with a drop-off at approximately 10 5 pages, which is due to the 
fact that crawlers don't systematically collect more pages per site than this 
bound because of server limitations. The power law, as well as the drop-off are 
illustrated in Figure 3. 
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A linear regression on the variables log(number of sites) and log(number of 
pages) yielded [1.647, 1.853] as the 95% confidence interval for the value fi in 
the Alexa crawl of 250,000 sites of the World Wide Web. For the Infoseek crawl, 
the 95% confidence interval for fi is [1.775, 1.909]. These estimates for the value 
of (3 are consistent across the two data sets and with the model, which predicts 
a linear dependence between the logarithm of the variables to be linear with 
slope — (3 < — 1. 

The existence of this universal power law has practical consequences as well, 
since one can estimate the expected number of sites of any arbitrary size, even 
if a site of that size has not yet been observed. This can be achieved by ex- 
trapolating the power law given by Eq. 9, to any large n s , e.g. P(n 8 2) = 
P(n s i)(n s i/n S 2)~ i:i . The expected number of sites of size n s2 in a crawl of N 
sites would be NP(n S 2)- As an example, from the Alexa data we can infer 
that if one were to collect data on 250,000 sites the probability of finding a site 
with a million pages would be 10~ 4 . Notice that this information is not readily 
available from the crawl alone, since it stops at 105 pages per site. 

Several points are worth making. First, since small values of n s lie outside 
the scaling regime, our theory does not explain the data on sites with few pages. 
Secondly, as a consequence of the universality of our prediction, as more sites 
will be created, the same power law behavior will be seen. This will once again 
allow for the determination of largest sites from data that will be limited in 
scope due to server limitations. Finally, since the process of ranking random 
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variables stemming from any broad distribution always produces a narrow and 
monotically decreasing power law of the type originally discussed by Zipf (10), 
we expect that such ranking will lead to a Zipf-like law(ll). 

In summary, we presented a stochastic theory of the growth dynamics of 
the Web that takes into account the wide range of stochastic growth rates in 
the number of pages per site, as well as the fact that new sites are created at 
different times in the unfolding story of the Web. This leads to the prediction 
of a universal power law in the distribution of the number of pages per site, 
which we confirm experimentally by analyzing data from two large crawls by 
the search engines Alexa and Infoseek. The existence of this power law not only 
implies the lack of any length scale for the Web, but also allows to estimate the 
number of sites of any given size without having to exhaustively crawl the Web. 
This is yet a another example of the strong regularities (12) that are revealed in 
studies of the Web, and which become apparent because of its sheer size and 
reach. 
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